47 research outputs found

    TensorFlow on state-of-the-art HPC clusters: a machine learning use case

    Get PDF
    The recent rapid growth of the data-flow programming paradigm enabled the development of specific architectures, e.g., for machine learning. The most known example is the Tensor Processing Unit (TPU) by Google. Standard data-centers, however, still can not foresee large partitions dedicated to machine learning specific architectures. Within data-centers, the High-Performance Computing (HPC) clusters are highly parallel machines targeting a broad class of compute-intensive workflows, as such they can be used for tackling machine learning challenges. On top of this, HPC architectures are rapidly changing, including accelerators and instruction sets other than the classical x86 CPUs. In this blurry scenario, identifying which are the best hardware/software configurations to efficiently support machine learning workloads on HPC clusters is not trivial. In this paper, we considered the workflow of TensorFlow for image recognition. We highlight the strong dependency of the performance in the training phase on the availability of arithmetic libraries optimized for the underlying architecture. Following the example of Intel leveraging the MKL libraries for improving the TensorFlow performance, we plugged the Arm Performance Libraries into TensorFlow and tested on an HPC cluster based on Marvell ThunderX2 CPUs. Also, we performed a scalability study on three state-of-the-art HPC clusters based on different CPU architectures, x86 Intel Skylake, Arm-v8 Marvell ThunderX2, and PowerPC IBM Power9.Postprint (author's final draft

    MPI+X: task-based parallelization and dynamic load balance of finite element assembly

    Get PDF
    The main computing tasks of a finite element code(FE) for solving partial differential equations (PDE's) are the algebraic system assembly and the iterative solver. This work focuses on the first task, in the context of a hybrid MPI+X paradigm. Although we will describe algorithms in the FE context, a similar strategy can be straightforwardly applied to other discretization methods, like the finite volume method. The matrix assembly consists of a loop over the elements of the MPI partition to compute element matrices and right-hand sides and their assemblies in the local system to each MPI partition. In a MPI+X hybrid parallelism context, X has consisted traditionally of loop parallelism using OpenMP. Several strategies have been proposed in the literature to implement this loop parallelism, like coloring or substructuring techniques to circumvent the race condition that appears when assembling the element system into the local system. The main drawback of the first technique is the decrease of the IPC due to bad spatial locality. The second technique avoids this issue but requires extensive changes in the implementation, which can be cumbersome when several element loops should be treated. We propose an alternative, based on the task parallelism of the element loop using some extensions to the OpenMP programming model. The taskification of the assembly solves both aforementioned problems. In addition, dynamic load balance will be applied using the DLB library, especially efficient in the presence of hybrid meshes, where the relative costs of the different elements is impossible to estimate a priori. This paper presents the proposed methodology, its implementation and its validation through the solution of large computational mechanics problems up to 16k cores

    Computational Fluid and Particle Dynamics Simulations for Respiratory System: Runtime Optimization on an Arm Cluster

    Get PDF
    Computational fluid and particle dynamics simulations (CFPD) are of paramount importance for studying and improving drug effectiveness. Computational requirements of CFPD codes involves high-performance computing (HPC) resources. For these reasons we introduce and evaluate in this paper system software techniques for improving performance and tolerate load imbalance on a state-of-the-art production CFPD code. We demonstrate benefits of these techniques on both Intel- and Arm-based HPC clusters showing the importance of using mechanisms applied at runtime to improve the performance independently of the underlying architecture. We run a real CFPD simulation of particle tracking on the human respiratory system, showing performance improvements of up to 2X, keeping the computational resources constant.This work is partially supported by the Spanish Government (SEV-2015-0493), by the Spanish Ministry of Science and Technology project (TIN2015-65316-P), by the Generalitat de Catalunya (2017-SGR-1414), and by the European Mont-Blanc projects (288777, 610402 and 671697).Peer ReviewedPostprint (author's final draft

    Dynamic load balancing for hybrid applications

    Get PDF
    DLB relies on the usage of hybrid programming models and exploits the malleability of the second level of parallelism to redistribute computation power across processes

    Lessons learned from a performance analysis and optimization of a multiscale cellular simulation

    Full text link
    This work presents a comprehensive performance analysis and optimization of a multiscale agent-based cellular simulation. The optimizations applied are guided by detailed performance analysis and include memory management, load balance, and a locality-aware parallelization. The outcome of this paper is not only the speedup of 2.4x achieved by the optimized version with respect to the original PhysiCell code, but also the lessons learned and best practices when developing parallel HPC codes to obtain efficient and highly performant applications, especially in the computational biology field

    Runtime Mechanisms to Survive New HPC Architectures: A Use-Case in Human Respiratory Simulations

    Get PDF
    Computational Fluid and Particle Dynamics (CFPD) simulations are of paramount importance for studying and improving drug effectiveness. Computational requirements of CFPD codes demand high-performance computing (HPC) resources. For these reasons we introduce and evaluate in this paper system software techniques for improving performance and tolerate load imbalance on a state-of-the-art production CFPD code. We demonstrate benefits of these techniques on Intel-, IBM-, and Arm-based HPC technologies ranked in the Top500 supercomputers, showing the importance of using mechanisms applied at runtime to improve the performance independently of the underlying architecture. We run a real CFPD simulation of particle tracking on the human respiratory system, showing performance improvements of up to 2x, across different architectures, while applying runtime techniques and keeping constant the computational resources.This work is partially supported by the Spanish Government (SEV-2015-0493), by the Spanish Ministry of Science and Technology project (TIN2015-65316-P), by the Generalitat de Catalunya (2017-SGR-1414), and by the European Mont-Blanc projects (288777, 610402 and 671697).Peer ReviewedPreprin

    Leveraging HPC Profiling & Tracing Tools to Understand the Performance of Particle-in-Cell Monte Carlo Simulations

    Full text link
    Large-scale plasma simulations are critical for designing and developing next-generation fusion energy devices and modeling industrial plasmas. BIT1 is a massively parallel Particle-in-Cell code designed for specifically studying plasma material interaction in fusion devices. Its most salient characteristic is the inclusion of collision Monte Carlo models for different plasma species. In this work, we characterize single node, multiple nodes, and I/O performances of the BIT1 code in two realistic cases by using several HPC profilers, such as perf, IPM, Extrae/Paraver, and Darshan tools. We find that the BIT1 sorting function on-node performance is the main performance bottleneck. Strong scaling tests show a parallel performance of 77% and 96% on 2,560 MPI ranks for the two test cases. We demonstrate that communication, load imbalance and self-synchronization are important factors impacting the performance of the BIT1 on large-scale runs.Comment: Accepted by the Euro-Par 2023 workshops (TDLPP 2023), prepared in the standardized Springer LNCS format and consists of 12 pages, which includes the main text, references, and figure

    Bronchial Aspirate-Based Profiling Identifies MicroRNA Signatures Associated With COVID-19 and Fatal Disease in Critically Ill Patients

    Get PDF
    Background: The pathophysiology of COVID-19-related critical illness is not completely understood. Here, we analyzed the microRNA (miRNA) profile of bronchial aspirate (BAS) samples from COVID-19 and non-COVID-19 patients admitted to the ICU to identify prognostic biomarkers of fatal outcomes and to define molecular pathways involved in the disease and adverse events. Methods: Two patient populations were included (n = 89): (i) a study population composed of critically ill COVID-19 and non-COVID-19 patients; (ii) a prospective study cohort composed of COVID-19 survivors and non-survivors among patients assisted by invasive mechanical ventilation (IMV). BAS samples were obtained by bronchoaspiration during the ICU stay. The miRNA profile was analyzed using RT-qPCR. Detailed biomarker and bioinformatics analyses were performed. Results: The deregulation in five miRNA ratios (miR-122-5p/miR-199a-5p, miR-125a-5p/miR-133a-3p, miR-155-5p/miR-486-5p, miR-214-3p/miR-222-3p, and miR-221-3p/miR-27a-3p) was observed when COVID-19 and non-COVID-19 patients were compared. In addition, five miRNA ratios segregated between ICU survivors and nonsurvivors (miR-1-3p/miR-124-3p, miR-125b-5p/miR-34a-5p, miR-126-3p/miR-16-5p, miR-199a-5p/miR-9-5p, and miR-221-3p/miR-491-5p). Through multivariable analysis, we constructed a miRNA ratio-based prediction model for ICU mortality that optimized the best combination of miRNA ratios (miR-125b-5p/miR-34a-5p, miR-199a-5p/miR-9-5p, and miR-221-3p/miR-491-5p). The model (AUC 0.85) and the miR-199a-5p/miR-9-5p ratio (AUC 0.80) showed an optimal discrimination value and outperformed the best clinical predictor for ICU mortality (days from first symptoms to IMV initiation, AUC 0.73). The survival analysis confirmed the usefulness of the miRNA ratio model and the individual ratio to identify patients at high risk of fatal outcomes following IMV initiation. Functional enrichment analyses identified pathological mechanisms implicated in fibrosis, coagulation, viral infections, immune responses and inflammation. Conclusions: COVID-19 induces a specific miRNA signature in BAS from critically ill patients. In addition, specific miRNA ratios in BAS samples hold individual and collective potential to improve risk-based patient stratification following IMV initiation in COVID-19-related critical illness. The biological role of the host miRNA profiles may allow a better understanding of the different pathological axes of the disease.We want particularly to acknowledge the patients, Biobank IdISBa and CIBERES Pulmonary Biobank Consortium (PT17/0015/0001), a member of the Spanish National Biobanks Network financed by the Carlos III Health Institute, with the participation of the Units of Intensive Care, Clinical Analysis and Pulmonology of Hospital Universitario Son Espases and Hospital Son Llatzer for their collaboration. This work was also supported by IRBLleida Biobank (B.0000682) and Plataforma Biobancos PT17/0015/0027/.Peer Reviewed"Article signat per 25 autors/es: Marta Molinero, Iván D. Benítez, Jessica González, Clara Gort-Paniello, Anna Moncusí-Moix, Fátima Rodríguez-Jara, María C. García-Hidalgo, Gerard Torres, J. J. Vengoechea, Silvia Gómez, Ramón Cabo, Jesús Caballero, Jesús F. Bermejo-Martin, Adrián Ceccato, Laia Fernández-Barat, Ricard Ferrer, Dario Garcia-Gasulla, Rosario Menéndez, Ana Motos, Oscar Peñuelas, Jordi Riera, Antoni Torres, Ferran Barbé and David de Gonzalo-Calvo* on behalf of the CIBERESUCICOVID Project (COV20/00110 ISCIII)"Postprint (published version

    Optimization of condensed matter physics application with OpenMP tasking model

    Get PDF
    The Density Matrix Renormalization Group (DMRG++) is a condensed matter physics application used to study superconductivity properties of materials. It’s main computations consist of calculating hamiltonian matrix which requires sparse matrix-vector multiplications. This paper presents task-based parallelization and optimization strategies of the Hamiltonian algorithm. The algorithm is implemented as a mini-application in C++ and parallelized with OpenMP. The optimization leverages tasking features, such as dependencies or priorities included in the OpenMP standard 4.5. The code refactoring targets performance as much as programmability. The optimized version achieves a speedup of 8.0 × with 8 threads and 20.5 × with 40 threads on a Power9 computing node while reducing the memory consumption to 90 MB with respect to the original code, by adding less than ten OpenMP directives.This work is partially supported by the Spanish Government through Programa Severo Ochoa (SEV2015-0493), by the Spanish Ministry of Science and Technology (project TIN2015-65316-P), by the Generalitat de Catalunya (contract 2017-SGR-1414) and by the BSC-IBM Deep Learning Research Agreement, under JSA “Application porting, analysis and optimization for POWER and POWER AI”. This work was partially supported by the Scientific Discovery through Advanced Computing (SciDAC) program funded by U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research and Basic Energy Sciences, Division of Materials Sciences and Engineering. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.Peer ReviewedPostprint (author's final draft
    corecore